{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d755f3e0-de66-4ecf-83d0-535310d13e59",
   "metadata": {},
   "source": [
    "# LangChain RAG with Locall LLM Tutorial\n",
    "\n",
    "This is based on Pixegami's tutorial. ([original repo](https://github.com/pixegami/rag-tutorial-v2/))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3dc8c87-e26e-4064-805e-1d7e6058f8c4",
   "metadata": {},
   "source": [
    "## Loading The Data\n",
    "\n",
    "More on Document loaders --> [official document](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be0c3b2f-e107-4166-ba6b-970e10b33e20",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/pixegami/rag-tutorial-v2/\n",
    "!mv rag-tutorial-v2/data ./\n",
    "!rm -rf rag-tutorial-v2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "71e2a607-33c2-4efa-b28b-2b21889c3c92",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.document_loaders.pdf import PyPDFDirectoryLoader\n",
    "\n",
    "DATA_PATH = \"data\"\n",
    "\n",
    "def load_documents():\n",
    "    document_loader = PyPDFDirectoryLoader(DATA_PATH)\n",
    "    return document_loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a8a71ca0-aea3-4d81-a081-1efe54927609",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n",
      "page_content='On a blustery autumn evening five old friends met in the backroom of one of the city’s oldest and most private clubs. Each had\\ntraveled a long distance — from all corners of the world — to meet on this very specific day… October 2, 1900 — 28 years to the\\nday that the London eccentric, Phileas Fogg accepted and then won a £20,000 bet that he could travel Around the World in 80 Days . \\nWhen the story of Fogg’s triumphant journey filled all the newspapers of the day, the five attended University together. Inspired by\\nhis impetuous gamble, and a few pints from the local pub, the group commemorated his circumnavigation with a more modest excur-sion and wager – a bottle of good claret to the first to make it to Le Procope in Paris.\\nEach succeeding year, they met to celebrate the anniversary and pay tribute to Fogg. And each year a new expedition (always mor e\\ndifficult) with a new wager (always more expensive) was proposed. Now at the dawn of the century it was time for a new impossi-ble journey. The stakes: $1 Million in a winner-takes-all competition. The objective: to see which of them could travel by rail to themost cities in North America — in just 7 days. The journey would begin immediately…\\nTicket to Ride is a cross-country train adventure. Players compete to connect different cities by laying claim to railway routes on a\\nmap of North America.\\nFor 2 - 5 players \\nages 8 and above\\n30 - 60 minutes[T2R] rules EN reprint 2015_TTR2 rules US  06/03/15  17:36  Page2' metadata={'source': 'data/ticket_to_ride.pdf', 'page': 0}\n"
     ]
    }
   ],
   "source": [
    "documents = load_documents()\n",
    "print(len(documents))\n",
    "print(documents[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a369d229-ad09-49db-aa6b-b5ea62db3382",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Split The Documents "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "eec69272-6fec-46ea-8310-8b2a87cd32f2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "from langchain.schema.document import Document\n",
    "\n",
    "def split_documents(documents: list[Document]):\n",
    "    text_splitter = RecursiveCharacterTextSplitter(\n",
    "        chunk_size=800,\n",
    "        chunk_overlap=80,\n",
    "        length_function=len,\n",
    "        is_separator_regex=False,\n",
    "    )\n",
    "    return text_splitter.split_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f02190a6-4478-4eb1-ae7a-ef6888d4375e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "41\n",
      "page_content='On a blustery autumn evening five old friends met in the backroom of one of the city’s oldest and most private clubs. Each had\\ntraveled a long distance — from all corners of the world — to meet on this very specific day… October 2, 1900 — 28 years to the\\nday that the London eccentric, Phileas Fogg accepted and then won a £20,000 bet that he could travel Around the World in 80 Days . \\nWhen the story of Fogg’s triumphant journey filled all the newspapers of the day, the five attended University together. Inspired by\\nhis impetuous gamble, and a few pints from the local pub, the group commemorated his circumnavigation with a more modest excur-sion and wager – a bottle of good claret to the first to make it to Le Procope in Paris.' metadata={'source': 'data/ticket_to_ride.pdf', 'page': 0}\n"
     ]
    }
   ],
   "source": [
    "chunks = split_documents(documents)\n",
    "print(len(chunks))\n",
    "print(chunks[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e42e848c-f668-437c-a6ce-72dbad72734b",
   "metadata": {},
   "source": [
    "## Embedding Function\n",
    "\n",
    "### Using Cloud Embedding Function\n",
    "\n",
    "Cloud hosted embedding function generally performs better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dec75fb5-c584-468d-97f0-5bec6cb5650a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_community.embeddings.bedrock import BedrockEmbeddings\n",
    "\n",
    "def get_embedding_function():\n",
    "    embeddings = BedrockEmbeddings(\n",
    "        credentials_profile_name=\"default\", region_name=\"us-east-1\"\n",
    "    )\n",
    "    # embeddings = OllamaEmbeddings(model=\"nomic-embed-text\")\n",
    "    return embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "143da522-dbda-425e-a3f8-d0f34a6c7f5a",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Running **Local** Embedding Function\n",
    "\n",
    "In case you want to run the embedding function locally as well, run the following cell instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61ee6926-62e9-4eae-bb4d-75077b3f6940",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!ollama list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8eb84fdd-6129-413c-9f86-63361bc15327",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ollama pull nomic-embed-text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8cc0274e-3958-4ae0-9f61-659ee63c3167",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_community.embeddings.ollama import OllamaEmbeddings\n",
    "\n",
    "def get_embedding_function():\n",
    "    embeddings = OllamaEmbeddings(model=\"nomic-embed-text\")\n",
    "    return embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c517df9e-9e60-4234-bd66-65cddcec8567",
   "metadata": {},
   "source": [
    "## Creating The Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3a65fbe2-afb6-4357-a8f7-e131eb1d1f7c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.vectorstores.chroma import Chroma\n",
    "\n",
    "CHROMA_PATH = \"chroma\"\n",
    "\n",
    "def add_to_chroma(chunks: list[Document]):\n",
    "    # Load the existing database.\n",
    "    db = Chroma(\n",
    "        persist_directory=CHROMA_PATH, embedding_function=get_embedding_function()\n",
    "    )\n",
    "\n",
    "    # Calculate Page IDs.\n",
    "    chunks_with_ids = calculate_chunk_ids(chunks)\n",
    "\n",
    "    # Add or Update the documents.\n",
    "    existing_items = db.get(include=[])  # IDs are always included by default\n",
    "    existing_ids = set(existing_items[\"ids\"])\n",
    "    print(f\"Number of existing documents in DB: {len(existing_ids)}\")\n",
    "\n",
    "    # Only add documents that don't exist in the DB.\n",
    "    new_chunks = []\n",
    "    for chunk in chunks_with_ids:\n",
    "        if chunk.metadata[\"id\"] not in existing_ids:\n",
    "            new_chunks.append(chunk)\n",
    "\n",
    "    if len(new_chunks):\n",
    "        print(f\"👉 Adding new documents: {len(new_chunks)}\")\n",
    "        new_chunk_ids = [chunk.metadata[\"id\"] for chunk in new_chunks]\n",
    "        db.add_documents(new_chunks, ids=new_chunk_ids)\n",
    "        db.persist()\n",
    "    else:\n",
    "        print(\"✅ No new documents to add\")\n",
    "        \n",
    "\n",
    "def calculate_chunk_ids(chunks):\n",
    "\n",
    "    # This will create IDs like \"data/monopoly.pdf:6:2\"\n",
    "    # Page Source : Page Number : Chunk Index\n",
    "\n",
    "    last_page_id = None\n",
    "    current_chunk_index = 0\n",
    "\n",
    "    for chunk in chunks:\n",
    "        source = chunk.metadata.get(\"source\")\n",
    "        page = chunk.metadata.get(\"page\")\n",
    "        current_page_id = f\"{source}:{page}\"\n",
    "\n",
    "        # If the page ID is the same as the last one, increment the index.\n",
    "        if current_page_id == last_page_id:\n",
    "            current_chunk_index += 1\n",
    "        else:\n",
    "            current_chunk_index = 0\n",
    "\n",
    "        # Calculate the chunk ID.\n",
    "        chunk_id = f\"{current_page_id}:{current_chunk_index}\"\n",
    "        last_page_id = current_page_id\n",
    "\n",
    "        # Add it to the page meta-data.\n",
    "        chunk.metadata[\"id\"] = chunk_id\n",
    "\n",
    "    return chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "535c3043-d0f9-4ba0-953d-43e1ebbe4abf",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of existing documents in DB: 41\n",
      "✅ No new documents to add\n"
     ]
    }
   ],
   "source": [
    "add_to_chroma(chunks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ac825264-bf17-4efb-9851-0abdca61e2ac",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.0K\t./chroma/ef219065-076e-496e-be5b-b3fa0deb744a/length.bin\n",
      "4.0K\t./chroma/ef219065-076e-496e-be5b-b3fa0deb744a/header.bin\n",
      "3.1M\t./chroma/ef219065-076e-496e-be5b-b3fa0deb744a/data_level0.bin\n",
      "0\t./chroma/ef219065-076e-496e-be5b-b3fa0deb744a/link_lists.bin\n",
      "3.1M\t./chroma/ef219065-076e-496e-be5b-b3fa0deb744a\n",
      "552K\t./chroma/chroma.sqlite3\n",
      "3.7M\t./chroma\n"
     ]
    }
   ],
   "source": [
    "!du -ah ./chroma"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e9626c3-db95-4eed-a0b4-cfd574b370c6",
   "metadata": {},
   "source": [
    "## Running RAG Query Locally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8337451-c6b3-43a1-8d43-bc32b0d339c9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain_community.llms.ollama import Ollama\n",
    "\n",
    "CHROMA_PATH = \"chroma\"\n",
    "\n",
    "PROMPT_TEMPLATE = \"\"\"\n",
    "Answer the question based only on the following context:\n",
    "{context}\n",
    "\n",
    "---\n",
    "Answer the question based on the above context: {question}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e863e782-48ca-4e18-9935-724983dc69b6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "embedding_function = get_embedding_function()\n",
    "db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6e445ec-4c87-4418-90a2-0acffda2369d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_text=\"How many clues can I give in Codenames?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "414d26fc-acfe-4f1f-8bda-83702eff3f88",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Search the DB.\n",
    "results = db.similarity_search_with_score(query_text, k=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e7a977c-591b-4b63-9fd2-5f8be3c8b357",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "context_text = \"\\n\\n---\\n\\n\".join([doc.page_content for doc, _score in results])\n",
    "prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)\n",
    "prompt = prompt_template.format(context=context_text, question=query_text)\n",
    "print(prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02f9e48f-c7fc-46c2-aa39-3eb946eb39b0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model = Ollama(model=\"mistral\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28c1184b-cde8-4088-b1b5-dcbea7b6dc69",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response_text = model.invoke(prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4937264-b2fe-4fae-b7cf-729618edf345",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(response_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c662d8a-07a6-4744-a9b5-4d2e16173a74",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sources = [doc.metadata.get(\"id\", None) for doc, _score in results]\n",
    "formatted_response = f\"Response: {response_text}\\nSources: {sources}\"\n",
    "print(formatted_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e6fcc4e-8750-43a9-9082-b9f3a43eab68",
   "metadata": {},
   "source": [
    "### Function-ized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "fd794263-718d-4bf9-9ce8-796952ba0b0c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain_community.llms.ollama import Ollama\n",
    "\n",
    "CHROMA_PATH = \"chroma\"\n",
    "\n",
    "PROMPT_TEMPLATE = \"\"\"\n",
    "Answer the question based only on the following context:\n",
    "{context}\n",
    "\n",
    "---\n",
    "Answer the question based on the above context: {question}\n",
    "\"\"\"\n",
    "\n",
    "def query_rag(query_text: str):\n",
    "    # Prepare the DB.\n",
    "    embedding_function = get_embedding_function()\n",
    "    db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)\n",
    "\n",
    "    # Search the DB.\n",
    "    results = db.similarity_search_with_score(query_text, k=5)\n",
    "\n",
    "    context_text = \"\\n\\n---\\n\\n\".join([doc.page_content for doc, _score in results])\n",
    "    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)\n",
    "    prompt = prompt_template.format(context=context_text, question=query_text)\n",
    "    # print(prompt)\n",
    "\n",
    "    model = Ollama(model=\"mistral\")\n",
    "    response_text = model.invoke(prompt)\n",
    "\n",
    "    sources = [doc.metadata.get(\"id\", None) for doc, _score in results]\n",
    "    formatted_response = f\"Response: {response_text}\\nSources: {sources}\"\n",
    "    print(formatted_response)\n",
    "    return response_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "c3598662-a4dc-4b49-8815-a36c3c34425d",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_text=\"How do I get out of jail in Monopoly?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "3c648863-0ad7-489b-aa33-a57f020e2510",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Response:  To get out of jail in Monopoly, you have several options: (1) throw doubles on any of your next three turns, (2) use a \"Get Out of Jail Free\" card if you have one, (3) purchase a \"Get Out of Jail Free\" card from another player and play it, or (4) pay a fine of $50 before rolling the dice on either of your next two turns. If you do not throw doubles by your third turn, you must pay the $50 fine and move forward the number of spaces shown by your next roll. You can also buy and sell property, buy and sell houses and hotels, and collect rents while in jail. However, you cannot collect your $200 salary if sent to jail during a move.\n",
      "Sources: ['data/monopoly.pdf:1:0', 'data/monopoly.pdf:0:0', 'data/monopoly.pdf:4:1', 'data/monopoly.pdf:1:1', 'data/monopoly.pdf:4:2']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "' To get out of jail in Monopoly, you have several options: (1) throw doubles on any of your next three turns, (2) use a \"Get Out of Jail Free\" card if you have one, (3) purchase a \"Get Out of Jail Free\" card from another player and play it, or (4) pay a fine of $50 before rolling the dice on either of your next two turns. If you do not throw doubles by your third turn, you must pay the $50 fine and move forward the number of spaces shown by your next roll. You can also buy and sell property, buy and sell houses and hotels, and collect rents while in jail. However, you cannot collect your $200 salary if sent to jail during a move.'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_rag(query_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97286027-0b01-44db-a8ae-665566f53984",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_text=\"How many clues can I give in Codenames?\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
